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<D ' Abstract 

■ We introduce a new procedure for training of artificial neural networks by using the 
T-H ! approximation of an objective function by arithmetic mean of an ensemble of selected ran- 

• domly generated neural networks, and apply this procedure to the classification (or pattern 

■ recognition) problem. This approach differs from the standard one based on the optimiza- 
, tion theory. In particular, any neural network from the mentioned ensemble may not be an 

^ I approximation of the objective function. 

^ ■ 1 Introduction 

>: 

! The standard approach to artificial neural networks is based on the optimization theory, cf. for 
^1 example [Ij. Artificial neural network is a composition of neurons (see the next section), which 

■ depends on the set of real parameters (weights of the neural network). In the pattern recognition 
^ ■ (or classification) problem a neural network is considered as an approximation of an objective 
CN ■ function (characteristic function of an objective, or target, set) at the training set. In order to find 

! this approximation the optimization problem (minimization of the norm of the difference between 
I the objective function and the neural network at the training set) at the space of parameters of 
^ I the neural network is studied. 

In the present paper we introduce a new approach to training of (ensembles of) artificial neural 
networks. In this approach instead of optimization of the parameters of a single neural network 
we consider an ensemble of selected neural networks with randomly chosen parameters. A neural 
network is selected if this network has a sufficiently small number of errors at the training set. 
We introduce the averaged neural network which in the simplest case is an arithmetic mean of 
selected neural networks. We show that the averaged neural network can be considered as some 
kind of approximation of the objective function. 

Using the introduced in the present paper approach we are able to avoid the two general 
problems of theory of neural networks: the problem of global optimization at a complex landscape 
and the problem of overfitting. 

The exposition of the present paper is as follows. 

In Section 2 we introduce the necessary notations and discuss the standard approach to training 
of neural networks based on the approximation theory. 
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In Section 3 we introduce some procedure based on selection and averaging as a method of 
training of (ensembles of) neural networks. 

In Section 4 we discuss relation of the construction of the present paper and theory of biological 
evolution. 



2 Training of neural networks and approximation theory 

Let us recall some definitions of theory of artificial neural networks. 

A neuron (or single layer perceptron) is a function of real variables of the form 



f{xi, xn) = sgn WiXi - 9^ . 



(1) 



Here Xi are real variables, {xi, . . . ,X]\f) takes values in some domain U C M^, Wi are real parameters 
(weights of the neuron), 6 is the threshold of activation of the neuron, the function sgn(x) = 1 for 
X > and is equal to zero for x < 0. 

We also consider the smoothed variant of the above neuron for which instead of the function 
sgn we use the smooth monotonous increasing function sgm which varies from zero to unity. In 
particular we consider the neuron of the form 

/ ^ A 1 

f{xi,...,XN) = sgm\y^WiXi-e] , sgm(x) = — -. (2) 

Vtr y 1 + ^ 

A neural network is a composition of the above neurons. 
Example Let us consider a double layer neural network with the neurons of the form ([1]) 

/(xi, ...,xn) = sgn ^iVi ~ ' " ^^^'^J' " ■ 

This function is equal to one at some final family of the sets {ui}-, yi = 0,1, i = 1, K. For 
any set {yi} from this family the corresponding Xi have to satisfy the system of inequalities 

N N 

WijXj > 9i, yi = 1, ^ WijXj <9i, yi = 0, i = l,...,K, 
i=i i=i 

i.e. to belong to the intersection of K half-spaces of the dimension A^. 

Therefore the above neural network ([3]) is equal to characteristic function of a finite union of 
intersections of half-spaces. 

Let us discuss the classification problem for neural networks. Let the domain [/ of / be a 
union of the two parts — the objective set T and the complement U\T of this set. 

The classification problem is as follows: to build an approximation of the objective function 
(the characteristic function xt of the objective set T) by a neural network /, i.e. by a composition 
of neurons of the form ([I]) or ([2]). 
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Let us recall some definitions of the approximation theory. Let V he a. normed linear space 
and M G V is some subset. An approximation of g E V by f E M have to satisfy 

\\9 - /II = inf/'eM||5' - /'||, 

where || ■ || is the norm in V (i.e. we find the element of M nearest to g E V). 

For neural networks the space ^ is a subset of L'^{U) (with the corresponding norm), M is the 
set of neural networks with the fixed architecture (i.e. the form of the composition of neurons is 
fixed but the weights Wi and activation thresholds 9 for neurons are the parameters of / G M). 

Our aim is to approximate the objective function Xt by / G M. The problem is that we do 
not know the exact form of the objective set T. Instead we have the training set X — the finite 
family of elements x E U for which we know, which of the elements x E X belongs to the objective 
set T and which x E X does not belong to T. 

This implies the following definition: the solution of the classification problem is the neural 
network / with the parameters Wi, 9 for which the rms (root mean square) deviation of the neural 
network from the objective function at the training set is minimal. Therefore the classification 
problem takes the form of some global optimization problem in the set of parameters of neural 
networks with the given architecture. 

For the investigation of this optimization problem neural networks with smooth neurons of the 
form are applied (because optimization methods such as steepest descent are used). The other 
approaches to optimization are Monte Carlo method, simulated annealing and other methods. 

There are the two main problems with training of neural network in the framework of opti- 
mization. First, global nonlinear optimization is a computationally hard problem for the case of 
multiple local minima. 

Second, there is a problem of overfitting — our neural network may approximate not the 
objective function but the particular choice of the training set. 

In the present paper we propose the alternative approach to classification with the help of 
neural networks which in some sense is free of the above problems. In this approach instead of 
finding the global minimum of the optimization problem we will take into account the contributions 
from the ensemble of local minima. 

3 Selection and averaging of neural networks 

Let us consider the set of neurons of the form ([T]) where Wi and 9 are independent real random 
variables with some distributions. For simplicity we consider random variables with equal distribu- 
tions. Using this ensemble of neurons we build the ensemble of neural networks 
{f[w, 9]{xi, . . . , xn)} as ensemble of compositions of neurons with independent random parame- 
ters (i.e. the form / of the composition is given and the parameters w, 9 are chosen independently 
for any neuron). 

Let X be the training set. Using the described above ensemble of neural networks we choose 
randomly a set of neural networks from this ensemble (corresponding to some choices of the 
random parameters w, 9) in the following way. All neural networks from this set take the required 
values on the training set (i.e. these neural networks take values equal to one for x E X from the 
objective set and take values equal to zero for x E X from the complement to the objective set). 
This choice of neural networks corresponds to selection of neural networks at the training set. 
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The ensemble of selected neural networks f[w, 0]{xi, . . . , xn) can be described as follows. The 
distribution functions for the parameters w, 9 of a neural network from the initial ensemble are 
multiplied by the characteristic function of the set of parameters for which the corresponding 
neural network will take the required values at the training set. After this procedure the joint 
distribution function of the parameters w, 6 have to be normalized (since the multiplication by 
the mentioned characteristic function breaks the normalization condition). The joint distribution 
function of the parameters w, 6 for neural network from the initial ensemble is equal to the product 
of distributions of all the parameters (since the parameters are chosen independently). After the 
selection procedure the parameters 6 oi a. neural network are no longer independent. 

Definition 1 Let us consider the finite set /[w", 6*^*], a = 1, . . . , n of independent selected neural 
networks. We introduce tlie averaged neural network (/) as tie limit of arithmetic means of 
selected neural networks 

1 " 

{f){xi, . . . ,xn) = lim {f)n{xi, . . . ,xn) = lim -V" 6'"] (xi, ... ,XAr). (4) 

n— >-oo n— >oo fl ' 

a=l 

Therefore the averaged neural network takes the required values at the training set. Here the 
parameters w'^, 6°" for different selected neural networks are independent (they are not necessarily 
independent for a fixed network). 

Selected neural networks may have different realizations (which we enumerate by a) but as 
random functions selected neural networks are equal. Thus the following expectation of a selected 
random network (with respect to the described above distribution of the parameters w, 9) will 
not depend on a 

Eif[w^,9'^]ix,,...,XN)). 

The main statement of the present paper is that in the limit of large n the averaged neural 
network (/)„ will be a solution of the classification problem, i.e. it will converge in probability 
to non-random function (/) which in some sense can be considered as an approximation of the 
objective function (the characteristic function of the objective set) for the classification problem 
under consideration. 

Proposition 2 Let the expectation and the dispersion of the random function f[w, 9]{xi, . . . , xjsi) 
exist. Then in the limit n — )■ oo the random function {f)n given hy ^ converges in probability 
pointwise to the non-random function 

(/)(xi, ...,xn) = E{f[w,9]{xi, . . .,xn)) . (5) 

Proof The proof is by the law of large numbers. Random functions f[w'^, 9'^] from the ensemble 
of selected neural networks are independent for different a and take values and 1. The dispersions 
of these random functions coincide (for fixed arguments xi, . . . , xn), therefore we can apply the 
law of large numbers which proves the existence of the limit in (jl]) and ([5]). □ 

Here the approximation of the objective function by the function (/) is not understood in 
the sense of the approximation theory as in the previous section (where the approximation of the 
objective function is the closest function from the family of functions of the given form). 



4 



By the construction the averaged neural network (/) takes values 1 and at the elements of 
the training set which belong to the objective set and its complement correspondingly. 

At the element (xi, . . . ,xn) G U of the domain U of the neural networks under consideration 
which does not belong to the training set X values of a part of the summands in (jl]) will be 
equal to one and values of another part will be equal to zero. Therefore for such (xi, . . . , xn) the 
averaged neural network (jlj) will take some value from [0, 1]. 

Example Let us consider the double layer neural network 



where all the weights Wi, Wij, 9, 6i are independent random variables. By the definition of selected 
neural network we choose randomly the family of parameters w"", wlj, 6"-, O"", a = 1, . . . ,n for 
which the corresponding neural network / will take the required values at the training set X. 
Therefore the averaged neural network 



will also take the required values at the training set. 

As we discussed for the example at the previous section, any of the summands /[tf", O"^] (double 
layer neural networks) in the expression above is equal to the characteristic function of a finite 
union of intersections of K half-spaces. Any of these summands can be far (in the sense of 
approximation theory) from the characteristic function of the objective set. In particular, it is 
possible that the summand f[w"',6°'] is a characteristic function of some polyhedron, and some 
part of this polyhedron lies in between the points of the training set which do not belong to the 
objective set. In this situation /[ly", 9"']{xi, . . . , x^) will be equal to one for (xi, . . . , xn) from this 
part of the mentioned polyhedron but it is natural to expect that (xi, . . . , xat) does not belong to 
the objective set. 

In summation over the ensemble of selected neural network we may have many such cases 
but any of these cases (for the particular (xi, . . . , xn) and flw"", O""]) has low probability since the 
random parameters w"', O"' for the different a are independent. Therefore in summation in (jlj) 
the corresponding contributions will be small because of the normalization K Thus the averaged 
neural network will give a better approximation of the objective function (characteristic function 
of the objective set) in comparison to the summands in (jlj). 

The summands f[w°',6°'] in (jlj) which give the required values at the elements of the training 
set in some approximation correspond to local minima of the root mean square (rms) deviation 
of the neural network f[w, 9] from the objective function. Therefore in (jlj) we sum over the local 
minima of the rms deviation instead of looking for the global minimum as in the optimization 
theory. 

Therefore the computational problem of finding of the global minimum in our approach is 
exchanged to the problem of finding of an ensemble of local minima. We are interested in simplifi- 
cation of this problem. Also it is important to make the definition of the averaged neural network 
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more robust to errors in the training set. We consider the following generalization of the averaged 
neural network. 

Generalization of the definition of averaged neural network for the case with errors. 

Let us consider a more general ensemble of independent neural networks {f[w"', 0°"]}, a = 1, . . . . 
Neural networks from this ensemble belong to the initial ensemble of neural networks (without 
selection) i.e. these networks may make errors when applied to the elements of the training set X 
(may be equal to one for x G X which lies outside the objective set or may be equal to zero for 
X E X which lies inside the objective set). Let the neural network f[w"', 9""] possesses m {f[w"', 9""]) 
errors at the training set X. 

We introduce the averaged neural network as the ri — )■ oo limit of finite linear combinations of 
independent neural networks 

(n \ n 

y^-^m(n^'^,e^]) y e-'^™(^["'"'''"l)/K,r](xi, . . .,xn). (6) 
a=l / a=l 

In the expression above the averaged neural network is a result of averaging over the Gibbs 
ensemble of independent random neural networks with the inverse temperature /3 > 0. The 
energy of the a-th neural network is equal to the number 'm{f[w"', 9^]) of errors of this network at 
the training set X. 

In the limit n ^ oo, hj the law of large numbers, expression (|6]) converges in probability to 
the Gibbs average 

{f){x„ ...,xr,) = {E (e-''™(/[-'^l)))"^ E {e~^-<-f^-''^^ f[w, e]{x„ x^)) . 

Here E is the expectation with respect to the initial ensemble of random neural networks. 

The neural network ([6]) is equal to one at the elements of the training set X from the objective 
set which are correctly (without errors) recognized by all the elements {flw"", 9"']} of the ensemble 
of random neural network (correspondingly, is equal to zero for correctly recognized elements of 
the training set from the complement to the objective set). 

Since we allowed errors the ensemble {flw"", 9""]} contains elements which are easier to generate 
in comparison to the case without errors considered earlier In the limit of zero temperature 
^ — )■ oo the expression ([6]) tends to the averaged neural network without errors @. We have 
expressed the selection procedure with the help of averaging over the Gibbs ensemble. 

Classification by ensembles of neural networks with different architectures. One of 

the advantages of the approach proposed in the present paper is the possibility to mix in the 
ensembles under consideration neural networks with different architectures, i.e. neural networks 
which are the different compositions of neurons of the form ([1]). 

Let us consider the ensemble of neural networks containing neural networks with different 
architectures f[w,9]{xi, . . . ,xn) (these networks will have the same domain, in particular will 
depend on the same number of variables, but the form of neural networks as compositions 
of neurons and the number of parameters w, 9 may be different for different networks from the 
ensemble). Neural networks with the fixed architecture, as earlier, are generated randomly with 
the independent parameters w, 9. 
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Let us introduce the generalization of the averaged neural network ([6]) of the form 



. . . , Xat) = lim — — , , ^ ^ r-^^ { . (7) 

^ ^ n^oo sr^n -I3(m(f'^('^)[w'^,e^])+k{c{a))) ^ ' 

Here the index c enumerates the different architectures of neural networks /'^ (this index takes 
a finite number of values), c(a) is the random architecture of the a-th randomly chosen neural 
network, the function k[c) of complexity of the network takes positive values and increases suffi- 
ciently fast with the increasing of complexity of the neural network j'^ (in particular one can take 
/c(c) to be equal to the number of neurons in the network), the other notations have the same 
meaning as in ([6]). 

In the limit n — oo expression ([7]) will converge in probability to the Gibbs average 



For low temperature (large /3) the main contribution to expression ([7]) comes from neural net- 
works which have sufficiently simple architectures and are able to solve the classification problem 
(i.e. to give considerable number of enumerated by the index a contributions with small num- 
ber of errors to expression ([7])). Since ([7]) contains contributions from neural networks with the 
different (in particular simple) architectures this will help to reduce the problem of overfitting of 
neural network — optimization of a neural network of unnecessarily complicated architecture for 
the particular form of the training set which may cause errors for a different training set with the 
same objective function. 



4 Discussion 



The standard approach to the classification (or pattern recognition) problem with neural networks 
is as follows: we choose the architecture of the neural network and then find the parameters of the 
network which give a better approximation of the objective function, i.e. solve the optimization 
problem. 

In the present paper we propose the alternative approach: instead of optimizing the particular 
neural network we consider the Gibbs ensemble of neural networks with different architectures and 
energy equal to the sum of the number of errors of the neural network at the training set and some 
increasing function of complexity of the neural network. Then for sufficiently low temperatures 
the Gibbs average over the ensemble of neural networks will give the solution of the classification 
problem. 

Let us discuss the following analogue with the theory of biological evolution. In accordance with 
the modern approach in evolution theory, so called •postmodern synthesis [2], biological systems are 
ensembles of replicators (in particular genes). In [2j it is stressed that it is necessary to consider 
genomes from the point of view of statistical physics applied to genomic sequences. 

As a development of this approach we propose to take into account the computational aspect 
of genomic sequences i.e. to consider genomes as ensembles of some simple algorithms. Any of 
these algorithms is a replicator (for example a gene). The key question in this approach will be 
the description of gene regulation as interaction of algorithms in the ensemble. 
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Why some ensemble of algorithms can function as a single algorithm? In particular, it is 
interesting to construct a simplest example of such an ensemble of algorithms which solves some 
problem. 

The second question, why biological evolution is possible, i.e. why selection and other manip- 
ulations with statistical ensembles can generate sufficiently complex algorithms starting from an 
ensemble of elementary algorithms? 

In the present paper we have constructed the ensemble of neural networks which solves the 
classification problem. Replication and mutation of the set of neural networks of different archi- 
tectures were used, and the selection procedure with the help of the Gibbs ensemble described 
above was applied. 

Let us note that in the standard approach to selection in evolution theory selection is considered 
as optimization procedure — one has to select elements with higher fitness. The main point of the 
approach of the present paper is the classification by an ensemble which contains neural networks 
with sufficiently different properties. 

The introduced in this paper method can be applied to the description of evolution of a genome 
as an ensemble of algorithms. This approach will be some version of the theory of group selection 
applied not to population of individuals but to genome as ensemble of replicators. 

The approach to genomes as probabilistic algorithms, in particular modeling of gene duplication 
by some analogue of replica procedure analogous to applied in theory of spin glasses |3j was 
proposed in [1]. 
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